In phonetics and speech science research, the R programming environment is commonly used for curating data and performing a vast array of statistical analyses. However, given the history and focus on statistics using the R language—“R is a free software environment for statistical computing and graphics” (www.r-project.org)—it is not often used as an environment for primary data analysis. A typical workflow might consist of analyzing data in another language such as MATLAB or Python and subsequently importing the processed data into R for statistical treatment. But this doesn’t have to be the case: for many of the analyses you might want to perform in one of those languages, you can do a similar analysis in R.
If you primarily want to do matrix calculations (such as for some types of image analysis), then a language like MATLAB (GNU Octave) or Julia might suit you better than R. If you want to do web scraping or object-oriented programming, then a language like Python or JavaScript might suit you better than R. But if you want to do some more basic types of primarily analysis of speech data, and if you already use R for your data curation and statistical treatment, then this tutorial might just convince you of the merits of keeping your entire workflow within the R environment. Not only will you potentially save time by focusing your efforts on one language, but there will be fewer opportunities for mistakes to arise in transferring data from one environment to another, and your workflow can be more easily replicated (both by yourself and by other researchers)
The data used in this two-part tutorial comes from a much larger tutorial that focuses on several different types of speech data: speech acoustics, aerodynamics, electromagnetic articulometry (EMA), ultrasound, and real-time magnetic resonance imaging (rt-MRI). However, the current tutorial focuses only on acoustic and ultrasound data analysis within the R environment. Therefore, some of the data sets used here will include data types that won’t actually be used or covered in this tutorial.
If you have used R before—and, since this tutorial is not recommended for R novices, you likely have—you’ll be aware of the sheer number and diversity of packages that are available. Due both to its age (built as an implementation of S, R is one of the oldest programming languages developed by academics and statisticians) and to its open-source nature (anyone can contribute to R!), R is one of the richest ecosystems currently available for data analysis. However, I personally believe that in order to best understand a programming language (and, therefore, understand what you can do with it) it is often best to keep as close to the base language as possible, rather than relying on packages. For the purposes of this tutorial, I have attempted to strike a balance between the base R language and packages that are an extension to that language. In some cases (like loading and playing audio data, or converting image matrices to rasterized images), the nature of the underlying R code involved is well beyond the theme of this tutorial; in these cases, we will rely on packages rather than doing it ourselves. In other cases (like generating spectrograms), the nature of the underlying R code involved is so extensive that it merits an entirely separate tutorial on its own; in these cases, we will rely on packages as well. But for most of the work that we will cover here, the base R language will be used. It is my hope that this will not only help you better understand the analyses that are involved, but that it will also help you come up with new ideas for your own research (and learn the tools that you can use to implement them in the process).
One more final note: although you should be able to complete the entire tutorial using an R session from the terminal, I highly recommend that you use the RStudio integrated development environment instead. You can keep a script open to copy and paste bits of the code from this tutorial, you will have a better understanding of the workspace data available to you at any given time, and you can compare figures more easily by clicking through the history in the Plots tab of the “Files, Plots, Packages, Help, Viewer” pane.
Now… without further ado… let’s get to it!
The following five packages will need to be installed (using install.packages('PACKAGENAME')) before using this tutorial. If you don’t have any of these installed yet, you can install them all in one go using:
install.packages(c("RCurl","sound","phonTools","raster","ggplot2"))
You can load the packages (using either library(PACKAGENAME) or require(PACKAGENAME)), but my preference is to call the package namespace explicitly when using a function (similar to using packages in Python). This is because some packages will have the same names for some functions (e.g., sound::play() and phonTools::play(), or base::as.matrix() and raster::as.matrix(), have the same function name but sometimes have different function requirements), and which function will be available to use as the default depends on the order in which the packages were loaded. Calling the package namespace explicitly will avoid potential issues that may occur if you just so happen to load packages in a different order than I did when making this R Markdown document.
The only package we will load specifically (and will use its functions without specifying the namespace) is ggplot2, because none of its functions conflict with other packages (as far as I’m aware):
require(ggplot2)
All of the data sets used in this tutorial can be accessed (indefinitely!) from a Dropbox directory I’ve set up. That way, you can go through the material at your own pace, whenever you want, on any computer you want, and even share this HTML tutorial document with others if you find it helpful… all without needing to keep and/or share local copies of the data.
In order to access these files, we’ll first have to create a small function that we’ll use whenever we need to download a particular data set:
dl_from_dropbox <- function(x, key) {
bin <- RCurl::getBinaryURL(paste0("https://dl.dropboxusercontent.com/s/", key, "/", x),
ssl.verifypeer = FALSE)
con <- file(x, open = "wb")
writeBin(bin, con)
close(con)
}
This function will download the associated data file from my Dropbox folder to your local working directory in R. You can then load it from there like any other .Rda file you might have lying around and follow along with all of the bits of the tutorial. Nifty, huh?
One of the disadvantages of using R for large-scale data analysis is that R has a tendency to hold on to memory. Many times, even if you clear variables from the workspace, R will retain a portion of the memory allocated to those variables. If you ever find yourself short on RAM as you work through the tutorial, there are a number of things you can do.
If you’re done working with a data set, and you’re moving on to the next one, it’s likely (although not certain, I haven’t double-checked everything in the entire tutorial!) that you won’t need any of the objects from the objects from the previous section except for the dl_from_dropbox() function. You can clear your workspace of all data except for that function using this code:
rm(list=setdiff(ls(), "dl_from_dropbox"))
If that doesn’t free up enough memory for you, you can do a “garbage collection”, which will free up some of the unused memory:
gc()
If you still find yourself short on RAM, you can do a full restart of the R session without touching the data in your workspace:
.rs.restartR()
Since, we’re interested in researching speech, it’s generally a good idea with these data sets to have a listen to what we’re dealing with before getting our hands dirty with the manipulating and analyzing the data. To do so, we’ll first have to tell R how to interface with the computer audio using the sound::setWavPlayer() function, which we’ll only have to do once for the session. If you’re using Windows, you may not have to do this part. In any case, here are some potential defaults to try for different OSes:
sound::setWavPlayer('/bin/aplay')
sound::setWavPlayer('/bin/mplayer')
sound::setWavPlayer('C:/Program Files/Window Media Player/wmplayer.exe')
sound::setWavPlayer('/Applications/"QuickTime Player.app"/Contents/MacOS/"QuickTime Player"')
NB: if you can’t get this working on your computer, it’s not the end of the world: you’ll still be able to do all of the tutorial exercises.
For exploring acoustic analyses, we’ll use audio recordings from a recent set of data that I collected for researching the respective roles of the lateral pterygoid (LPT) and medial pterygoid (MPT) muscles in the rotation and translation of the jaw during speech.
FYI: for this study, hooked wires needed to be placed inside the jaw muscles using a hollow needle.
Yes, it was as uncomfortable as it looks. No, I didn’t get any money for it. Yes, I would definitely do it again!
Let’s start by loading the data set that contains the audio we’ll use in this section:
dl_from_dropbox('EMA-EMG_data.Rda', 'cdzc6vcjnn8ppqw')
load('EMA-EMG_data.Rda')
And now let’s take a peek at what’s inside, shall we?
summary(EMA.EMG)
## Length Class Mode
## SR 3 -none- list
## audio 346925 -none- numeric
## LPT 140469 -none- numeric
## MPT 140469 -none- numeric
## EMA 11 data.frame list
## segments 4 data.frame list
We can see that there are six fields in the data set: SR, audio, LPT, MPT, EMA, and segments. The first field is a list of the sampling rates for the three different data types (audio, EMG, and EMA):
EMA.EMG$SR
## $audio
## [1] 22050
##
## $EMG
## [1] 8928
##
## $EMA
## [1] 100.0912
We will need to access this information later in order to relate points in time (seconds) to points in the audio data (samples). The last field in the main data structure contains segmentation info which we’ll use later. Let’s explore the data a bit by having a listen to the first speech sample. We’ll first save the audio sampling rate to a variable, because we’ll be using it a lot in this section:
sr <- EMA.EMG$SR$audio
audio <- sound::as.Sample(as.vector(EMA.EMG$audio), fs=sr)
sound::play(audio)
For acoustic visualization and measures, we’ll be using the excellent (and extensive!) phonTools R package created by Santiago Barreda. We’ve truly got everything (including the proverbial kitchen sink) in this package here, so in this tutorial we’ll just be scratching the surface of the tool set available in the package. It’s definitely worth the time to explore the tools on your own!
One thing to note about this part of the tutorial. I’m a pretty firm believer that it’s best to take the time to understand a black box before using it (although I will be the first to admit that I don’t do this myself as often as I should). For many people, formant analysis (especially formant tracking over time) is a black box. So we’re going to focus on formant analysis for this tutorial in order to better understand how to work with acoustic data in R, more generally. In doing so, we’re going to reinvent the wheel a bit here before we actually get to the wheel, with the goal of better understanding the R language and environment itself and also to open the black box of formant analysis and shed a bit of light inside.
Let’s start things off by creating one of our favorite visualizations for speech acoustics: the trusty spectrogram. To keep things simple, we’ll use the audio from the EMA-EMG data set that we’ve just used in the previous exercise. Using the phonTools::spectrogram() function, we simply need to supply the audio data and sampling rate:
phonTools::spectrogram(EMA.EMG$audio, fs=sr)
NB: you may get some “the condition has length > 1” warnings here, which you can ignore.
If you’ve generated spectrograms in MATLAB or Python before, you may have seen a default color scheme like this one; it’s a color scheme that works well for differentiating levels in a continuous range (like the dB range of a spectrogram). But if you’ve only worked with spectrograms in Praat, you may find it a bit weird, in which case you can set the colors argument to FALSE or simply F if you want to get a more familiar grayscale representation:
phonTools::spectrogram(EMA.EMG$audio, fs=sr, colors=F)
We’ve obviously got several different words here (which you may also remember from listening to the audio earlier), so we’ll focus on only one word when playing around with the spectrogram. Let’s first take a look at EMA.EMG$segments to see what we’re dealing with here:
EMA.EMG$segments
## word phone start end
## 1 asas@ a 1.540144 1.706447
## 2 asas@ s 1.706447 1.917608
## 3 asas@ a 1.917608 2.105793
## 4 asas@ s 2.105793 2.293978
## 5 asas@ @ 2.293978 2.477787
## 6 olol@ o 3.487426 3.665764
## 7 olol@ l 3.665764 3.787209
## 8 olol@ o 3.787209 4.015876
## 9 olol@ l 4.015876 4.138415
## 10 olol@ @ 4.138415 4.305812
## 11 epep@ e 5.399845 5.522384
## 12 epep@ p 5.522384 5.710570
## 13 epep@ e 5.710570 5.851708
## 14 epep@ p 5.851708 6.001600
## 15 epep@ @ 6.001600 6.185409
## 16 afaf@ a 7.245524 7.395416
## 17 afaf@ f 7.395416 7.573754
## 18 afaf@ a 7.573754 7.779445
## 19 afaf@ f 7.779445 7.943560
## 20 afaf@ @ 7.943560 8.138310
## 21 {s{s@ { 9.271109 9.436318
## 22 {s{s@ s 9.436318 9.619033
## 23 {s{s@ { 9.619033 9.821441
## 24 {s{s@ s 9.821441 9.981180
## 25 {s{s@ @ 9.981180 10.152954
## 26 {r{r@ { 11.733739 11.900042
## 27 {r{r@ r 11.900042 12.067439
## 28 {r{r@ { 12.067439 12.325647
## 29 {r{r@ r 12.325647 12.445998
## 30 {r{r@ @ 12.445998 12.604643
## 31 itit@ i 13.669756 13.793389
## 32 itit@ t 13.793389 14.045032
## 33 itit@ i 14.045032 14.180701
## 34 itit@ t 14.180701 14.365604
## 35 itit@ @ 14.365604 14.563636
The first word “asas@” (SAMPA convention) spans rows 1 through 5 here, so let’s use the start and end time points for the first and last segments of that word to generate a new spectrogram. Let’s grab those two time points and assign them to some variables:
t1 <- EMA.EMG$segments$start[1]
t2 <- EMA.EMG$segments$end[5]
t1; t2
## [1] 1.540144
## [1] 2.477787
Okay, so we’ve got a column for word, a column for individual phones, and columns for the time points corresponding to the beginning and end of those phones. These time points are in seconds. But we have a small problem: how do we relate these time points to specific samples in the audio data? We have to use the sampling rate. Recall that sampling rate is defined as the number of data samples recorded per second. So, given a time point in seconds, we can find the specific sample by simply multiplying the time point by the sampling rate, e.g.,
t1 * sr
## [1] 33960.16
Whoops! This is problematic. There is no sample number 33960.16. Samples must be whole integers because they are discrete units. So let’s just round these samples when we assign them to the variables:
t1 <- round(EMA.EMG$segments$start[1] * sr)
t2 <- round(EMA.EMG$segments$end[5] * sr)
t1; t2
## [1] 33960
## [1] 54635
Okay, that’s more like it. The beginning of the word “asas@” corresponds to sample number 33960 in the audio data, and the end of the word corresponds to sample number 54635 (aren’t you glad we assigned these numbers to variables so we don’t have to remember them??). Let’s use these samples to generate a new spectrogram that only includes this word. While we’re at it, let’s also change the dynamic range from the default (50 dB) to something like 65 dB, in order to better approximate a spectrogram that you’d see in Praat:
phonTools::spectrogram(EMA.EMG$audio[t1:t2], fs=sr, colors=F, dynamicrange=65)
There are a few different arguments that you can play with here (use ? phonTools::spectrogram to investigate), such as changing the frequency range:
phonTools::spectrogram(EMA.EMG$audio[t1:t2], fs=sr, colors=F, dynamicrange=65, maxfreq=8000)
Hey, look at that: we can see the fricatives now!
We’ve made a pretty spectrogram, but… so what? What’s the point here for research beyond visualization purposes? Well, what do you use a spectrogram in Praat for? Lots of stuff!
Let’s take a simple scenario: measuring average formant values within some selected time interval. We already have the time points of the individual sounds in EMA.EMG$segments, but let’s say that we didn’t have these time points already and we want to find them by clicking on the spectrogram (like you would in Praat). To do this in R, we’ll use a nifty little function from base R called locator(). This function will allow you to click on whichever plot is currently displayed and return values from the \(x\)- and \(y\)-axes of that plot. It’s really handy for doing analyses like this.
Let’s first make sure again that you have the correct spectrogram showing in the Plots tab of RStudio:
phonTools::spectrogram(EMA.EMG$audio[t1:t2], fs=sr, colors=F, dynamicrange=65, maxfreq=8000)
Now use the locator() function and assign the results to a variable (which we’ll call coords). Once you’re done clicking, either hit the “STOP” button in the Console window, or if that doesn’t work (it doesn’t on my computer), hit “Esc” on your keyboard instead. Alternatively, if you know ahead of time that you’re only going to click a specific number of times (two times, in our case), you can provide a number to the function (2, in our case), which will then stop automatically after you have clicked that many times:
coords <- locator(2)
Let’s try to segment the second vowel in the word. After clicking twice on the above spectrogram (once at the beginning and once at the end of the vowel) I get the following results:
coords
## $x
## [1] 389.8630 551.4366
##
## $y
## [1] 2871.832 2944.500
What we’re really interested in here are the two points shown in $x, since time is displayed on the \(x\)-axis in the spectrogram plot. Keep in mind that the units are the same as those that are shown in the plot; so in this case, these are temporal values in ms. By converting to seconds and converting to samples, we can make a new spectrogram of just that second vowel in the word.
t1 <- round(coords$x[1]*sr/1000)
t2 <- round(coords$x[2]*sr/1000)
phonTools::spectrogram(EMA.EMG$audio[t1:t2], fs=sr, colors=F, dynamicrange=65, maxfreq=8000)
WAIT… hang on… that looks awful! What gives here? This doesn’t look like a vowel at all, and why did the time window reset? Didn’t it start at 0 ms in the previous spectrogram as well?
Unlike when viewing different temporal intervals of the audio signal in Praat, this spectrogram function from phonTools won’t show you the time relative to the start of the file; it will only show you the time relative to the start of the selection used to generate the spectrogram. In other words, the beginning of the \(x\)-axis will always be time = 0. So we first need to add the temporal offset between the start of the audio file and the beginning of the selection. As a reminder, this offset is the beginning of the word:
offset <- EMA.EMG$segments$start[1]
offset
## [1] 1.540144
This is the beginning of the spectrogram for the entire word (relative to the start of the audio file). That was the spectrogram that was showing when we clicked with the locate() function and got these coordinates:
coords
## $x
## [1] 389.8630 551.4366
##
## $y
## [1] 2871.832 2944.500
So this means that the vowel begins 389.86 ms after the start of the word, which itself begins 1.54 s after the start of the audio file. Aha! So we have to add those time points together to get the beginning of the vowel relative to the start of the audio file. We need to convert the coordinates from ms to s, then add them to the offset, then convert to samples, then round to the nearest whole integer sample. Did you follow that? If not, try to work through that reasoning in your head before moving on.
This will take a bit of wrangling, so let’s save the two time points as variables so that we don’t get too confused later on:
t1 <- round(
(offset + coords$x[1]/1000)*sr
)
t2 <- round(
(offset + coords$x[2]/1000)*sr
)
phonTools::spectrogram(EMA.EMG$audio[t1:t2], fs=sr, colors=F, dynamicrange=65, maxfreq=8000)
(insert animated GIF of The Fonz)
“Heyyy…” now that looks more like it…
So now we have verified that the time points of the beginning and end of the vowel are samples that are relative to the start of the audio signal. We can now use the phonTools::findformants() function to get the average formant values (based on LPC modeling) within that time interval (i.e. an average spectral slice of the whole vowel):
phonTools::findformants(EMA.EMG$audio[t1:t2], fs=sr)
There’s a lot to take in here, but let’s walk through what information we get. In the left plot, we see the average spectrum within the analysis window and the LPC curve overlaid in solid black. The vertical dotted black lines are formant measures that were found but then rejected, and the vertical solid colored lines are the “winning” formant candidates. The first three formants are shown by the red line (F1), the green line (F2), and the blue line (F3) on the left. In the right plot, we see the location of those formants (i.e. LPC poles) within the two-dimensional space created by the real and imaginary parts of the FFT complex numbers.
It might also be helpful for your particular analysis (or as a diagnostic test) to view the formant bandwidths (showbws=T), which are easier to see if you turn off the rejected candidates (showrejected=F):
phonTools::findformants(EMA.EMG$audio[t1:t2], fs=sr, showbws=T, showrejected=F)
Finally, if you want the actual numbers rather than just looking at a plot, you can save the output to a variable (and ignore the plot display itself to save time by setting verify=F):
formants <- phonTools::findformants(EMA.EMG$audio[t1:t2], fs=sr, verify=F)
formants
## formant bandwidth
## 1 825.76 187.3241
## 2 1397.33 180.2710
## 3 2786.89 247.4086
## 4 3712.67 134.2290
## 5 4121.18 212.0571
## 6 7924.99 514.0352
## 7 9531.92 426.6678
So the function gives 826 Hz and 1397 Hz for F1 and F2, respectively. If you remember correctly, the analysis window under consideration here is the second [a] of “asas@”, so these formant values seem pretty accurate!
So now we’re at the point of being able to quantify formants for an average spectral slice. A logical next step might be to do the same thing for a whole bunch of vowels and create a vowel space. So let’s do it!
This time, instead of clicking to get the time points for each vowel in our data set (which you are able to do yourself now, if you wanted to), we’ll make our job a bit easier by just using the time points that are already provided in EMA.EMG$segments:
EMA.EMG$segments
## word phone start end
## 1 asas@ a 1.540144 1.706447
## 2 asas@ s 1.706447 1.917608
## 3 asas@ a 1.917608 2.105793
## 4 asas@ s 2.105793 2.293978
## 5 asas@ @ 2.293978 2.477787
## 6 olol@ o 3.487426 3.665764
## 7 olol@ l 3.665764 3.787209
## 8 olol@ o 3.787209 4.015876
## 9 olol@ l 4.015876 4.138415
## 10 olol@ @ 4.138415 4.305812
## 11 epep@ e 5.399845 5.522384
## 12 epep@ p 5.522384 5.710570
## 13 epep@ e 5.710570 5.851708
## 14 epep@ p 5.851708 6.001600
## 15 epep@ @ 6.001600 6.185409
## 16 afaf@ a 7.245524 7.395416
## 17 afaf@ f 7.395416 7.573754
## 18 afaf@ a 7.573754 7.779445
## 19 afaf@ f 7.779445 7.943560
## 20 afaf@ @ 7.943560 8.138310
## 21 {s{s@ { 9.271109 9.436318
## 22 {s{s@ s 9.436318 9.619033
## 23 {s{s@ { 9.619033 9.821441
## 24 {s{s@ s 9.821441 9.981180
## 25 {s{s@ @ 9.981180 10.152954
## 26 {r{r@ { 11.733739 11.900042
## 27 {r{r@ r 11.900042 12.067439
## 28 {r{r@ { 12.067439 12.325647
## 29 {r{r@ r 12.325647 12.445998
## 30 {r{r@ @ 12.445998 12.604643
## 31 itit@ i 13.669756 13.793389
## 32 itit@ t 13.793389 14.045032
## 33 itit@ i 14.045032 14.180701
## 34 itit@ t 14.180701 14.365604
## 35 itit@ @ 14.365604 14.563636
The first problem we come across is that we have time points for all of the sounds, including consonants, which we don’t want: we only want the time points for the vowels. The vowels we have in this mini data set are “a”, “e”, “i”, “o”, “{” and “@” (SAMPA conventions). The first thing we need to do is see how many individual vowel sounds are in our data set by asking the question: how many (sum) individual sound segments (EMA.EMG$segments$phone) match one of (%in%) the symbols in our vowel list (vowels)?
vowels <- c("a","e","i","o","{","@")
vowelnum <- sum(EMA.EMG$segments$phone %in% vowels)
vowelnum
## [1] 21
So we’ve got 21 vowels in the data set. That sounds right, because we’ve got seven tri-syllabic words in the data set:
length(unique(EMA.EMG$segments$word))
## [1] 7
Our goal for this exercise is to end up with a table that has F1 and F2 values for each of these 21 vowels. So the table needs to have 21 entries, and three dimensions (vowel, F1, F2) for each one. We can pre-allocate a data frame accordingly:
form.dat <- data.frame(
vowel = character(vowelnum),
f1 = numeric(vowelnum),
f2 = numeric(vowelnum)
)
Now we’re going to loop through the list of segments, and if we come across a segment that is a vowel, we’ll make some formant measures and add them to the table. But there’s a small problem for this loop: there are 35 rows in EMA.EMG$segments, and only 21 of those are vowels, but we don’t know ahead of time which 21 rows of the total 35 are associated with the vowels. This is a situation where I personally like to use forced iteration: I start with a variable given a value of 1, and then manually increase the value of that variable by 1 if the conditions for taking the measurement are met.
Another bit that we’ll add, since it will likely be a methodological step that will be of interest to many of you. Instead of taking the average formant measurements across the entire vowel interval, we’ll take these measurements between 20% and 80% of the vowel interval, in order to avoid effects due to coarticulation with any surrounding consonants.
So let’s start with a description of what we’re going to be doing here, and you can hopefully match up the description with the code below:
x given a value of 1x variablex <- 1
for (phone in 1:nrow(EMA.EMG$segments)) {
if (EMA.EMG$segments$phone[phone] %in% vowels) {
t1 <- round(EMA.EMG$segments$start[phone]*sr)
t2 <- round(EMA.EMG$segments$end[phone]*sr)
dur <- t2 - t1
t20 <- t1 + round(dur*0.2)
t80 <- t1 + round(dur*0.8)
formants <- phonTools::findformants(EMA.EMG$audio[t20:t80], fs=sr, verify=F)
form.dat$vowel[x] <- EMA.EMG$segments$phone[phone]
form.dat[x, c("f1","f2")] <- formants$formant[1:2]
x <- x+1
}
}
par(mfrow=c(1,1))
ggplot(form.dat, aes(x=f2,y=f1,col=vowel,shape=vowel)) + geom_point(cex=3) +
scale_y_reverse() + scale_x_reverse() + theme_bw()
That’s a pretty good looking little vowel space, if I do say so myself!
At this point, we’ve made some spokes for the wheel we’re re-inventing. But now let’s put a rubber tire on it and see if it will get us down the road a bit.
The basic idea with formant tracking is that you want to measure how formants change over time. Pretty simple, right? Well… what do we consider to be time? On the one hand, you can’t take formant measurements at every audio sample: FFT decomposition assumes that you have enough samples of the waveform that you can capture the individual sinusoidal components. On the other hand, you can’t take formant measurements of the entire audio signal: you would only get a single set of formant measures, which defeats the purpose of formant tracking. So the “window” where you take the measurements can’t be too small or too large: windows of a few milliseconds are generally used. For our exercise, we’ll use a 10 ms window:
winlen <- 0.01
We then need to convert this time window (in seconds) to a number of audio samples (winlen), and then determine how many full windows (bins) will fit in the audio recording:
winlen <- round(sr*winlen)
bins <- floor(length(EMA.EMG$audio)/winlen)
If we were doing an actual, full, real formant tracking analysis here, this would be a bit more sophisticated, including overlapping windows (generally 50% overlap) and the last remaining samples of the audio file (those that remain after the complete number of full windows/bins) would also be analyzed separately. For for the sake of simplicity, we won’t do those things here.
Let’s take a look at what the first audio bin will look like:
chunk <- EMA.EMG$audio[1:winlen]
plot(chunk, type='l')
The idea here is that we will do a formant analysis on this first bin of the signal, then move on to the second bin and do a formant analysis there, then move on to the third bin, etc. until the end of the file. However, cutting up the audio signal in this way can sometimes result in strange FFT decomposition (and, therefore, strange formant values) due to the signal being cut off abruptly at the beginning and at the end. Because of this, a tapered window of some sort is usually applied to the segmented audio before doing the analysis, in order to slope the samples downward toward exactly 0 at the beginning and end of the signal. A Hanning window is a very common window shape (also referred to as a kernel) to use.
A Hanning window is defined by the function below:
hannwin <- ( 0.5 - (0.5 * cos(2*pi*(0:(winlen-1))/(winlen-1))) )
plot(hannwin)
Conveniently, the phonTools package also includes a windowing function that we can use instead! By multiplying the signal and the kernel, we get a tapered version of the segmented audio:
hannwin <- phonTools::windowfunc(winlen, type="hanning")
plot(hannwin*chunk, type='l')
That’s better! Analyzing an audio segment like this will give more reliable results.
So now the idea do this exact same thing for all of the separate audio segments that can fit within the whole recording. We’ll accomplish that by simply looping through each segment (bin) in all of the segments (bins) and repeating the same process as above. Additionally, we’ll keep track of time by logging the time point (in seconds) of middle of each audio segment. See if you can follow all of that in the code below:
formant.tracks <- data.frame(time=numeric(bins),
f1=numeric(bins),
f2=numeric(bins),
f3=numeric(bins))
for (bin in 1:bins) {
t1 <- (bin-1)*winlen + 1
t2 <- bin*winlen
chunk <- EMA.EMG$audio[t1:t2]
formants <- phonTools::findformants(hannwin*chunk, fs=sr, verify=F)
midpoint <- mean(c(t1,t2))/sr
formant.tracks[bin,] <- c(midpoint, formants$formant[1:3])
}
One thing that linear predictive coding doesn’t have that you do is high-level knowledge of speech acoustics: LPC is dumb and you are not. Some formant values are simply not at all reasonable to ever expect, no matter what the LPC results give. There are more sophisticated ways to deal with this (such as shifting entire formant measure sets up or down if you suspect that a particular formant has been skipped), but one easy way to deal with this is to simply remove any phonetically unreasonable values from the results by assigning them NA values or setting them to 0:
formant.tracks$f1[formant.tracks$f1 < 250 | formant.tracks$f1 > 1200] <- NA
formant.tracks$f2[formant.tracks$f2 < 800 | formant.tracks$f2 > 2800] <- NA
formant.tracks$f3[formant.tracks$f3 < 2500 | formant.tracks$f3 > 4000] <- NA
So how does our wheel look now that we’ve completely re-invented it? Let’s find out by overlaying the formant measures on top of the spectrogram of a section of the audio. Let’s first take some time points, and extract a section of the formant tracks that lie between those two points:
t1 <- 5.3
t2 <- 6.3
formant.chunk <- formant.tracks[formant.tracks$time>=t1 & formant.tracks$time<=t2,]
Now let’s plot the spectrogram for the audio between those two time points, and then overlay the formant data:
NB: the ylim values c(180,4820) probably appear strange, but due to automatic padding that R sometimes applies to plots these are the correct values for fitting the formant tracks exactly into a 0-5000 Hz window.
phonTools::spectrogram(EMA.EMG$audio[round(t1*sr):round(t2*sr)], fs=sr, colors=F)
par(new=T,mfrow=c(1,1))
plot(formant.chunk$time - t1, formant.chunk$f1,
col=1, pch=16, cex=0.75,
xaxt='n', yaxt='n', xlab=NA, ylab=NA, ylim=c(180,4820))
points(formant.chunk$time - t1, formant.chunk$f2,
col=2, pch=16, cex=0.75,
xaxt='n', yaxt='n', xlab=NA, ylab=NA)
points(formant.chunk$time - t1, formant.chunk$f3,
col=3, pch=16, cex=0.75,
xaxt='n', yaxt='n', xlab=NA, ylab=NA)
Any car that we build using this wheel probably won’t win any races, but it will certainly get us to the grocery store and back.
Thankfully, you don’t have to re-invent the wheel every time you want to do formant tracking. Let’s see now what the phonTools::formanttrack() gives us for the same window length (and also 0% window overlap, as we have just done):
phonTools::formanttrack(EMA.EMG$audio[round(t1*sr):round(t2*sr)], formants=3,
fs=sr, windowlength=10, timestep=10)
Yeah, a bit better. But Santiago Barreda has spent a lot of time and effort really fine-tuning his wheel, so let’s stop imposing our specifications of window length and the lack of window overlap, and we’ll just see how his wheel performs on its own:
phonTools::formanttrack(EMA.EMG$audio[round(t1*sr):round(t2*sr)], formants=3, fs=sr)
We can see where some errors have been made at the very edges of the vowels, but all in all, this wheel seems to work really well! And again, we can extract the results (formant values and time points) by assigning the output to a variable, which you can use for subsequent analysis:
NB: remember that the time stamps (ms) in the results are relative to the beginning of the segment that is analyzed, here t1 = 5.3 seconds.
formants <- phonTools::formanttrack(EMA.EMG$audio[round(t1*sr):round(t2*sr)], formants=3, fs=sr, show=F)
head(formants)
## time f1 f2 f3
## 1 30.1 597.88 3153.11 0.00
## 2 35.1 590.03 3139.91 0.00
## 3 40.1 572.97 1802.26 3194.94
## 4 75.1 579.05 1572.22 3035.95
## 5 80.1 585.22 3056.71 0.00
## 6 85.1 582.88 3130.10 0.00
We’re going to move away from audio data now and turn to a type of speech data that comes with its own set of challenges: vocal tract images. In this tutorial, we’ll only be looking at ultrasound images of the tongue surface, but the larger tutorial also covers some analyses of magnetic resonance (MR) images of the entire vocal tract.
NB: there are existing R packages for working with ultrasound data, such as the rticulate package created by Stefano Coretta. However, for the purposes of this tutorial, we’ll be building some tools and methods from the ground up (like we did for formant analysis) in order to better understand the data and how to work with the data in the R environment.
The ultrasound data set comes from a recent study where I collected audio, ultrasound, nasalance, and electroglottographic data from (Southern) French speakers, and then used the audio as stimuli to present to (Australian) English listeners who weren’t familiar with French. Their task was to simply imitate what they heard while I collected the same acoustic and articulatory data. Let’s load the data (this will probably take a bit longer than the previous data set because now we’re dealing with images):
dl_from_dropbox('ultrasound_data.Rda', '68nfc5yie653kf7')
load('ultrasound_data.Rda')
summary(US.data)
## Length Class Mode
## SR 2 -none- list
## data1 3 -none- list
## data2 3 -none- list
So in this data set, we actually have two separate smaller data sets: US.data$data1 and US.data$data2. In this first section, we’ll be using data1, and we’ll switch to data2 in Part #2 of the tutorial.
Let’s have a listen to the speech sample for the first part of the data set, which is one of the naive listeners imitating the French word “pain”:
sr <- US.data$SR$audio
audio <- sound::as.Sample(as.vector(US.data$data1$audio), sr)
sound::play(audio)
Unlike the previous data we’ve been working with, images (grayscale images, that is) are usually kept in three dimensional arrays, with the first two dimensions (rows and columns) corresponding to the image dimensions (height and width, respectively), while the third dimension (pages) corresponds to the individual images/frames. Three dimensional arrays of this sort are often referred to as image “stacks”, since the two-dimensional images are stacked in sequential order along the third dimension of the array. The elements of the third dimension are referred to “pages”, since they are like paper pages (which, individually, are two dimensional) stacked together into a book (which is then three dimensional).
So, in order to access all of the pixels of an individual image, we must include all of the rows and all of the columns for that particular image, i.e. indexing [ , , N] for the Nth image in the array. We can view one of the ultrasound image frames with the image() function. Let’s view the first image of our image stack…
NB: dealing with graphics in R can be more frustrating than in other languages/environments, such as MATLAB. An example here is that there isn’t a simple way to view an image well in grayscale without specifying the range exlicitly. Since we’ll be doing this a lot for these images, lets just make that range a variable that we can use in all of the plots.
grayscale <- gray(seq(0, 1, length = 256))
image(US.data$data1$images[,,1], xaxt='n', yaxt='n', col=grayscale)
If you’re not familiar with ultrasound tongue imaging, what we’re looking at here is the midsagittal surface of the tongue, where the high-frequency sound waves that are emitted from the ultrasound transducer reflect off of the impedance boundary created between the tongue surface and the air inside the mouth. The tongue tip is to the right and the root of the tongue is to the left. There are 12 images for this production (dim(US.data$data1$images)[3] = 12), representing 400 ms of data (US.data$SR$US = 30). Go ahead and take a minute to plot some different images to see how the tongue moves over time!
One of the most common methods of analyzing tongue shapes in ultrasound data is to fit contours to the tongue surface and extract the \(x\)- and \(y\)-coordinates of these contours. Contour fitting can be accomplished either fully manually, semi-automatically (automatically, but with manual user intervention), or fully automatically. In my experience, semi-automated methods provide the best trade-off between accuracy and time commitment, but even those methods often take a lot of time. As such, tongue contour fitting is very rarely done in a dynamic way, i.e. with contours fitted to each image frame in an ultrasound recording. Rather, single contours are taken from individual tokens at a specific time point, and the shapes are compared, e.g., between different sounds or different contexts.
For this data set, I’ve included three contours that were fit semi-automatically using the EdgeTrak program. These contours were taken at 25%, 50%, and 75% of the vowel interval in the imitated “pain” token included here. A trace of the palate shape is also included in US.data$data1$contours$palate, which can be plotted along with a tongue shape in order to contextualize where the tongue is in relation to the palate. Let’s first plot one of the contours (50% of the vowel interval) to see what it looks like:
plot(US.data$data1$contours$coords$x50, US.data$data1$contours$coords$y50, type='l')
Yep, it looks like a tongue alright! But it doesn’t really tell us interesting without some more context, so we need to plot all of the tongue contours together along with the palate contour. Only one small detail: there are different ranges of the data for each of these elements. In order to compare all of the contours, we need to put them on the same ranges for the two axes. We can do this by finding out the overall minimum and maximum values for each dimension using the range() function. Let’s first see an example for the range of data in the \(x\)-dimension:
xrange <- range(c(
US.data$data1$contours$palate$x,
US.data$data1$contours$coords$x25,
US.data$data1$contours$coords$x50,
US.data$data1$contours$coords$x75
))
xrange
## [1] -33.625 31.075
This means that, when considering the \(x\)-dimension values across the contours for all three tongue positions and also the palate, the smallest value is -33.625 and the largest value is 31.075. We can use this information to set the xlim argument in the plot() function, and then do the same for the \(y\)-dimension values.
Your goal for this exercise is to provide a bit of context by plotting the palate along with all three tongue shapes in a single figure. You should have learned enough in the tutorial by this point to have a good idea about how to figure this one out, so I won’t be providing any additional help here.
Plot the palate in black, the tongue shape at 25% in blue, the tongue shape at 50% in purple, and the tongue shape at 75% in red. Make sure to change the limits of the axes to match the range of all of the data (do this in a programmatic way, rather than just “hard coding” the values). Also, change the axis titles to match the figure below.
Good luck… and have fun!
Hint: I used the following code to add the legend:
legend("bottomright", legend=c("25%","50%","75%"), col=c("blue","purple","red"), lty=1)
One issue with measuring tongue contours is interpreting the \(x\)- and \(y\)-dimension values that are generated by whichever tongue contour fitting program you use: you need to be aware of whether you are dealing with Cartesian or polar coordinates in order to interpret the data and take appropriate measurements. If you’re not familiar with these concepts (or if you’re not familiar with how they apply to ultrasound tongue imaging), let’s look at an example of the problem.
Here is a typical ultrasound setup for speech data recordings, using a transducer stabilization headset developed by Donald Derrick:
The coordinates that are exported by EdgeTrak are in Cartesian space, meaning that the \(x\)- and \(y\)-dimensions are related to each other in a linear way: an increase of 1 unit in the \(x\) dimension is the same distance in the space as an increase of 1 unit in the \(y\) dimension. It is reasonable for EdgeTrak to export coordinates in this way because the actual image that is being processed by the software (and which the user interacts with) is a two dimensional Cartesian plane: a change of 1 pixel to the right or left is the same distance in the image space as a change of 1 pixel up or down. So these Cartesian coordinates interpret image data in a linear plane like this:
However, this is not how ultrasound imaging works. Reflections from the ultrasound transducer do not propagate in a linear space; rather, reflections are calculated along radial lines. In other words, image data from the transducer is actually obtained in a polar plane like this:
What this means for your interpretation of the data and decision for measurements and analyses is that the position of the tongue means different things at different parts of the image. In other words, the polar plane is a non-linear coordinate space. This means, for example, that you can’t build certain statistical models (like SSANOVAs, GAMMs, or functional PCA) directly on the Cartesian coordinates.
So what can you do? Just give up and try something else?
Nah. Just convert between the two coordinate systems!
The idea here is to take any given point in a Cartesian space and ask the question: if this point were somewhere along the circumference of a circle, what would be the angle of that point along the entire circumference and how large would the circle be (i.e. what is its radius)? Typically, this conversion uses a polar system where an angle of 0\(^\circ\) extends to the right from the polar origin, and the angle increases in a counter-clockwise manner.
Let’s look at an example. The image below is a two dimensional image (Cartesian space). There are two dots placed on this image. In this Cartesian space, we can describe the red dot as being higher (increase along the \(y\) dimension) and slightly to the right (slight increase in the \(x\) dimension), in comparison to the green dot. But when describing the relative position of these two dots in polar space (represented by the red circles and the blue radial lines), we can describe the red dot as having a larger polar angle (60\(^\circ\) vs 30\(^\circ\)), and being located at a larger distance from the origin (twice the distance), in comparison to the green dot. These two polar coordinates are expressed as \(\theta\) (the “polar angle” or the “angular coordinate”) and \(r\) (the “radial distance”, the “radial coordinate”, or simply the “radius”).
Our task then is to brush up a bit on our trigonometry and convert Cartesian coordinates to polar coordinates using the assumption of the Cartesian point being located at the corner of a right triangle that lies along the hypotenuse, with the origin (here the ultrasound transducer) located at the other corner:
When considering any points \(x\) and \(y\) in Cartesian space, the radial coordinate (\(r\)) can then be calculated as:
r <- sqrt(x^2 + y^2)
And the angular coordinate (\(\theta\)) can be calculated as:
th <- atan2(y, x)
That’s it! This will calculate the angular coordinate in radians, but it may be easier to interpret the result if we convert to degrees, which can be done by multiplying by 180 and dividing by \(\pi\):
th <- th*180/pi
Let’s go ahead and put all of that into a small function that we can use to convert from Cartesian coordinates to polar coordinates. We’ll add in an optional argument that can be used to automatically convert from radians to degrees:
cart2polar <- function(x, y, degrees=F) {
r <- sqrt(x^2 + y^2)
th <- atan2(y, x)
if (degrees) {
th <- th*180/pi
}
return(data.frame(r, th))
}
Let’s not stop there! What if we want to convert back from polar coordinates to Cartesian coordinates? Given any coordinates \(r\) and \(\theta\) in polar space, the \(x\) coordinate can be calculated as:
x <- r*cos(th)
And the \(y\) coordinate can be calculated as:
y <- r*sin(th)
If the angular coordinate is in degrees instead of radians, we simply need to multiply \(\theta\) by \(\pi\) and divide by 180:
x <- r*cos(th*pi/180)
y <- r*sin(th*pi/180)
Let’s now put all of that together into a similar function to convert polar coordinates to Cartesian coordinates, again adding an optional argument to use if the angular coordinate is in degrees instead of radians:
polar2cart <- function(r, th, degrees=F) {
if (degrees) {
x <- r*cos(th*pi/180)
y <- r*sin(th*pi/180)
} else {
x <- r*cos(th)
y <- r*sin(th)
}
return(data.frame(x, y))
}
Now we can use the cart2polar() function to convert the original Cartesian coordinates of the tongue contours into polar coordinates. We’ll first create a field polar within US.data$data1$contours, convert the contour coordinates, and add them to that newly created field. Let’s first do that for the 25% vowel time point:
US.data$data1$contours$polar <- c()
pol25 <- cart2polar(US.data$data1$contours$coords$x25,
US.data$data1$contours$coords$y25,
degrees=T)
US.data$data1$contours$polar$r25 <- pol25$r
US.data$data1$contours$polar$th25 <- pol25$th
Follow that? Now we can do the same thing for the other two time points:
pol50 <- cart2polar(US.data$data1$contours$coords$x50,
US.data$data1$contours$coords$y50,
degrees=T)
US.data$data1$contours$polar$r50 <- pol50$r
US.data$data1$contours$polar$th50 <- pol50$th
pol75 <- cart2polar(US.data$data1$contours$coords$x75,
US.data$data1$contours$coords$y75,
degrees=T)
US.data$data1$contours$polar$r75 <- pol75$r
US.data$data1$contours$polar$th75 <- pol75$th
And lastly we’ll do the same thing for the palate trace and add it to a new field polpalate:
polpal <- cart2polar(US.data$data1$contours$palate$x,
US.data$data1$contours$palate$y,
degrees=T)
US.data$data1$contours$polpalate <- polpal
So what do the data look like when converted to polar coordinates? Let’s see!
xrange <- range(c(
US.data$data1$contours$polpalate$th,
US.data$data1$contours$polar$th25,
US.data$data1$contours$polar$th50,
US.data$data1$contours$polar$th75
))
yrange <- range(c(
US.data$data1$contours$polpalate$r,
US.data$data1$contours$polar$r25,
US.data$data1$contours$polar$r50,
US.data$data1$contours$polar$r75
))
plot(NA, xlim=rev(xrange), ylim=yrange,
xlab="Angular coordinate (theta)",
ylab="Radial coordinate (r)")
lines(US.data$data1$contours$polpalate$th, US.data$data1$contours$polpalate$r)
lines(US.data$data1$contours$polar$th50, US.data$data1$contours$polar$r25, col='blue')
lines(US.data$data1$contours$polar$th25, US.data$data1$contours$polar$r50, col='purple')
lines(US.data$data1$contours$polar$th75, US.data$data1$contours$polar$r75, col='red')
Well now… that looks quite a bit different, doesn’t it? Now that you have the tongue contours in polar coordinates, you can submit them to the statistical model of your choice!
Before we finish off this section, we’ll do one more thing to better understand the conversion that we just did. Let’s create a set of lines in polar space and convert them to Cartesian coordinates in order to understand how the original tongue contours (in Cartesian space) relate to what the ultrasound transducer is actually imaging (in polar space).
Let’s plot the polar data again, create a set of \(\theta\) lines from 0\(^\circ\) to 180\(^\circ\), and then add those to the plot:
plot(NA, xlim=rev(xrange), ylim=yrange,
xlab="Angular coordinate (theta)",
ylab="Radial coordinate (r)")
lines(US.data$data1$contours$polpalate$th, US.data$data1$contours$polpalate$r)
lines(US.data$data1$contours$polar$th25, US.data$data1$contours$polar$r25, col='blue')
lines(US.data$data1$contours$polar$th50, US.data$data1$contours$polar$r50, col='purple')
lines(US.data$data1$contours$polar$th75, US.data$data1$contours$polar$r75, col='red')
pol.lines <- list(th=seq(0,180,by=10),
r1=rep(10,19),
r2=rep(100,19))
for (line in 1:19){
lines(
c(pol.lines$th[line],pol.lines$th[line]),
c(pol.lines$r1[line],pol.lines$r2[line]),
lty=2)
}
Now let’s use our polar2cart() function to convert those \(\theta\) lines to Cartesian coordinates:
cart.lines1 <- polar2cart(pol.lines$r1,
pol.lines$th,
degrees=T)
cart.lines2 <- polar2cart(pol.lines$r2,
pol.lines$th,
degrees=T)
And now we’ll create a new plot with the original (Cartesian) tongue/palate data and add the \(\theta\) lines that we’ve just converted to Cartesian coordinates:
xrange <- range(c(
US.data$data1$contours$palate$x,
US.data$data1$contours$coords$x25,
US.data$data1$contours$coords$x50,
US.data$data1$contours$coords$x75
))
yrange <- range(c(
US.data$data1$contours$palate$y,
US.data$data1$contours$coords$y25,
US.data$data1$contours$coords$y50,
US.data$data1$contours$coords$y75
))
plot(NA, xlim=xrange, ylim=yrange,
xlab="<-- Posterior (mm) | Anterior (mm) -->",
ylab="<-- Inferior (mm) | Superior (mm) -->")
lines(US.data$data1$contours$palate$x, US.data$data1$contours$palate$y)
lines(US.data$data1$contours$coords$x25, US.data$data1$contours$coords$y25, col='blue')
lines(US.data$data1$contours$coords$x50, US.data$data1$contours$coords$y50, col='purple')
lines(US.data$data1$contours$coords$x75, US.data$data1$contours$coords$y75, col='red')
for (x in 1:19){
lines(
c(cart.lines1$x[x],cart.lines2$x[x]),
c(cart.lines1$y[x],cart.lines2$y[x]),
lty=2)
}
At the end of Part #1, we looked at how to work with tongue contour data in R. I mentioned that fitting tongue contours (in some other program, such as EdgeTrak) is “one of the most common methods” of analyzing ultrasound data… but it certainly doesn’t have to be the only one! In the rest of the tutorial, we’ll explore some ways of analyzing ultrasound tongue images directly in R without having to fit tongue contours. In this first section, we’ll analyze image change along a “line of interest”, and in the final section we’ll analyze change in the entire image.
For these analyses, we’ll use a separate set of data from the same participant, i.e. US.data$data2. Let’s have a listen to the audio recording:
sr <- US.data$SR$audio
audio <- sound::as.Sample(as.vector(US.data$data2$audio), sr)
sound::play(audio)
Ahhh, an /hVd/ word list! That’s pretty standard. This was a reference recording for the imitation study, but I ended up not analyzing these data in any way…
…which means I never fit any tongue contours. :-(
Is there anything we can do with the image data anyway? Yes, of course!
Let’s first take a look at what kind of overall tongue movement we’re dealing with across the images in this recording. To do this, we’ll apply the var function over the first two dimensions of the combined image stack—which will, somewhat counter-intuitively, give us the variance across the third dimension (i.e. all of the images):
varimg <- apply(US.data$data2$images, 1:2, var)
image(varimg, col=grayscale)
In this image, we’re looking for areas of white to indicate regions where the images change a lot over the recording—in other words, areas of the image where the tongue is moving. What we’re going to do in this section is pick a part of the tongue surface and place a line of interest, and then see where the tongue moves along that line over the course of the recording.
For our example, we’re going to draw a line over the top part of the root of the tongue, in order to capture the degree of tongue retraction.
To create the line of interest, we’ll use our handy little locator() function again and supply 2 as an input because we’re (again) only going to click twice:
image(varimg, col=grayscale)
coords <- locator(2)
lines(coords$x, coords$y, col='red', lwd=2)
points(coords$x[1], coords$y[1], col='cyan', pch=1, cex=2)
points(coords$x[2], coords$y[2], col='orange', pch=1, cex=2)
NB: You need to be aware of the order in which you made the selection (we will see why later). To visualize the order in which I selected these particular points, I have used a cyan circle to indicate the first selection (anterior to the tongue surface) and an orange circle to indicate the second selection (posterior to the tongue surface).
Now we need to “simply” make a measurement of the intensities of the pixels that cross this line. Why is “simply” in quotes? Because, as we will see now, working with images can be very tricky indeed, in part because of different traditions and conventions for the image coordinate system: one convention (and, thus, the functions that are written by people who follow this convention) assumes the origin of the coordinate system in the bottom-left of the image, another convention assumes the origin at the top-left. And the assumption for the coordinate system of the mask may be one of (or neither) of these… frustrating! So in order to make sure that our line traverses the correct pixels of the image, we will need to do a bit of investigation. Inside the documentation provided by ? image, we find the following:
“Notice that image interprets the z matrix as a table of f(x[i], y[j]) values, so that the x axis corresponds to row number and the y axis to column number, with column 1 at the bottom, i.e. a 90 degree counter-clockwise rotation of the conventional printed layout of a matrix.”
Uh oh. We’re used to thinking of matrices the other way around: with \(x\) values corresponding to columns and \(y\) values corresponding to rows! Not a problem, we just need to make sure that we associate the values in coords$x with rows (i.e. dim(varimg)[1]) and those in coords$y with columns (i.e. dim(varimg)[2]).
You may have noticed that the image() function plots the image along a 0-1 range for both dimensions. This means that the coordinates we currently have are proportions of the total dimensions of the image. We can use that information to construct a new set of coordinates that are the nearest pixels to the points that we previously selected using coords <- locator(2). We can do this by rounding the values we get after multiplying the coordinates (proportions) by the total dimensions of the image (making sure that we are associating \(x\) values with rows and \(y\) values with columns):
coords2 <- c()
coords2$x <- round(coords$x * dim(varimg)[1])
coords2$y <- round(coords$y * dim(varimg)[2])
coords2
## $x
## [1] 264 114
##
## $y
## [1] 193 221
Nice. So the nearest pixels to the points I selected are at (264,193) for the first point and (114,221) for the second point. Now that we have the pixels that are at the ends of the line, our goal now is to find the nearest pixels that fall along the line extending between those two end points. We first need to find out how long the line is (in pixels), which we can do by calculating the Euclidean distance (and rounding to the nearest pixel):
edist <- round(
sqrt(
(coords2$x[1] - coords2$x[2])^2 + (coords2$y[1] - coords2$y[2])^2
)
)
edist
## [1] 153
So the line extending from pixel (264,193) to pixel (114,221) is 153 pixels long. We can use that information to extend a line of coordinates between the two end points that has 153 values. We will do this for both the \(x\) and \(y\) dimensions:
xx <- seq(coords2$x[1], coords2$x[2], length.out=edist)
yy <- seq(coords2$y[1], coords2$y[2], length.out=edist)
If we round these values, then we get the nearest pixels that intersect a line that is 153 pixels long and extends from pixel (264,193) to pixel (114,221):
head(cbind(round(xx),round(yy))); tail(cbind(round(xx),round(yy)))
## [,1] [,2]
## [1,] 264 193
## [2,] 263 193
## [3,] 262 193
## [4,] 261 194
## [5,] 260 194
## [6,] 259 194
## [,1] [,2]
## [148,] 119 220
## [149,] 118 220
## [150,] 117 220
## [151,] 116 221
## [152,] 115 221
## [153,] 114 221
Now we have the coordinates of a line that we can use to take a measurement! Let’s use the first image in the data set as a test to see what kind of measurement might be good to take. We’ll first plot the image along with the line of interest:
image(US.data$data2$images[,,1], col=grayscale)
lines(coords$x, coords$y, col='red', lwd=2)
points(coords$x[1], coords$y[1], col='cyan', pch=1, cex=2)
points(coords$x[2], coords$y[2], col='orange', pch=1, cex=2)
So the tongue surface should show up somewhere along that line as larger values (higher pixel intensity). Let’s get the intensity values of the pixels coordinates (in xx and yy) that fall along the line and plot the results:
vals <- numeric(edist)
for (val in 1:edist) {
vals[val] <- US.data$data2$images[round(xx)[val],round(yy)[val],1]
}
plot(vals, type='l')
There it is! The tongue surface is that peak around index 60 (well, technically, just before that peak, since the peak is the reflection at the boundary between the tongue surface and air). Remember that the first selection point at pixel (264,193) was anterior to the tongue surface and the second selection point at pixel (114,221) was posterior to the tongue surface. This means that a lower value for the location of the peak corresponds to a more anterior tongue root (i.e. closer to the first selection point) and a higher value for the location of the peak corresponds to a more posterior tongue root (i.e. closer to the second selection point). This means that we can interpret this dimension as a metric of tongue retraction since an increase corresponds to a more retracted tongue position.
To make our measurement, we simply need to identify a value along this vector of 153 values, which will correspond to the location of the peak. One approach is to find the location of the highest value; you can do this using which.max(vals). However, this approach may be a bit too discrete for our purposes, yielding a signal that has poor temporal and spatial resolution. Another approach is to take a measurement that is more continuous than simply locating individual pixels, i.e. a measurement that will allow the estimation of values between pixels. One nice way to do this is to calculate the center of gravity of the values:
cog <- sum(seq(1,edist) * vals) / sum(vals)
cog
## [1] 71.14161
We can see that this is a value between 71 and 72, but is closer to 71. This continuous nature of the measurement will give us better resolution for our tongue retraction measure.
Now that we have a measure to use, we can apply it to all of the images in the data set. If you inspect the code below, you’ll see that we’re simply following all of steps that we’ve just gone over, but applying them to each of the 251 images in the data set.
imgnum <- dim(US.data$data2$images)[3]
cogs <- numeric(imgnum)
for (img in 1:imgnum){
this.img <- US.data$data2$images[,,img]
vals <- numeric(edist)
for (val in 1:edist) {
vals[val] <- this.img[round(xx)[val],round(yy)[val]]
}
cogs[img] <- sum(seq(1,edist) * vals) / sum(vals)
}
plot(cogs, type='l')
And there we have it, folks: a time varying tongue retraction signal for the whole recording! The values aren’t really interpretable, though, since they’re just values along the length of some arbitrary line of interest that we selected. So let’s go ahead and scale the values so that they represent a percentage of the entire range of values in the recording:
cogs <- 100 * (cogs - min(cogs)) / (max(cogs) - min(cogs))
Let’s see if our signal makes sense by comparing some back vowels and front vowels. What is the order of the /hVd/ words in the recording?
US.data$data2$segments$word
## [1] head hood heed heard haired who'd hard hoard had hud
## [11] hid hod
## Levels: had haired hard head heard heed hid hoard hod hood hud who'd
Okay, so let’s pick out two words with front vowels and compare them to two words with back vowels. We’ll choose “heed” and “hid” for the front vowels, which are words 3 (“heed”) and 11 (“hid”) in the list. And we’ll choose “hard” and “hoard” for the back vowels, which are words 7 (“hard”) and 8 (“hoard”). We’ll use these values to index the time points for the start point of each word in US.data$data2$segments$wstart and the end point of each word in US.data$data2$segments$wend.
Let’s first get the time points for the two front /hVd/ words and plot them against normalized time in red lines:
t1 <- round(US.data$data2$segments$wstart[3] * US.data$SR$US)
t2 <- round(US.data$data2$segments$wend[3] * US.data$SR$US)
chunk <- cogs[t1:t2]
wtime <- seq(0, 100, length.out=length(chunk))
plot(wtime, chunk, type='l', col='red', ylim=range(cogs),
xlab="Normalized time (%)", ylab="Tongue retraction (%)")
t1 <- round(US.data$data2$segments$wstart[11] * US.data$SR$US)
t2 <- round(US.data$data2$segments$wend[11] * US.data$SR$US)
chunk <- cogs[t1:t2]
wtime <- seq(0, 100, length.out=length(chunk))
points(wtime, chunk, type='l', col='red')
Now let’s do the same thing for the two back /hVd/ words and add their signals in blue lines:
t1 <- round(US.data$data2$segments$wstart[7] * US.data$SR$US)
t2 <- round(US.data$data2$segments$wend[7] * US.data$SR$US)
chunk <- cogs[t1:t2]
wtime <- seq(0, 100, length.out=length(chunk))
points(wtime, chunk, type='l', col='blue')
t1 <- round(US.data$data2$segments$wstart[8] * US.data$SR$US)
t2 <- round(US.data$data2$segments$wend[8] * US.data$SR$US)
chunk <- cogs[t1:t2]
wtime <- seq(0, 100, length.out=length(chunk))
points(wtime, chunk, type='l', col='blue')
legend("topright", legend=c("front Vs", "back Vs"), col=c("red","blue"), lty=1)
Not too bad! Not only do we have a good (and predictable) separation between the front and back /hVd/ words, but we can see some nice patterns of coarticulation: the tongue is already in position for the vowel during the /h/, and the horizontal tongue position converges at the end of each of the words for the production of /d/.
Another interesting possibility is to apply a purely data-driven approach to images by directly analyzing the changing pixel intensities in ultrasound images, rather than estimating the location of the tongue surface within the images. This approach has the distinct advantage of allowing for dynamic analysis of ultrasound tongue video with a time investment that is many orders of magnitude smaller than fitting and analyzing tongue contours. The disadvantage to this approach is that spatial information is sometimes lost, and interpretation of the results has to be taken with extreme care.
One possible method for this approach is to use principal components analysis (PCA) to explain the maximum amount of image variance in the ultrasound video. PCA is useful for an analytics problem of this sort because we can assume that there will be a high degree of inter-correlation among the pixels in the images: when the tongue moves in a certain direction, a large number of pixels in the image frame will all become bright at the same time, while a large number pixels in the image frame will also all become dark at the same time. What this tells is that any given individual pixel site does not capture much meaningful information about the tongue itself… rather, it is only with a collection of inter-correlated pixel sites that we can capture information about a factor that we are truly interested in (in our case, tongue movement).
We can thus use PCA to our advantage in this case because the model captures individual factors (or components) in the data that are explained by co-variance among individual dimensions (in our case, the pixel sites). The components identified by the model will explain fundamental degrees of freedom represented in the image data that cannot be explained by the individual pixel sites themselves. Let’s take a look at how this works in practice with another set of ultrasound data (US.data$data2).
This data set comes from one repetition of a baseline word list of English /hVd/ words spoken by the same Australian English imitator whose data we used in the previous exercise. Let’s have a listen to the audio:
audio <- sound::as.Sample(as.vector(US.data$data2$audio), sr)
sound::play(audio)
In order to create a PCA model, we need to arrange the data into a two-dimensional data matrix such that columns correspond to individual pixel sites (the dimensions of the model) and rows correspond to subsequent frames in the ultrasound video (the observations of the model). One thing to keep in mind is that a well-fitted PCA model should have have more observations than dimensions. In this mini data set, there are 251 frames, which will mean that there will be 251 observations in the PCA model. However, the dimensions of any given ultrasound frame are 690 \(\times\) 470. This means that there are 324,300 individual pixel sites… almost 1300\(\times\) as many dimensions as observations! We would need nearly 1.5 hours of this type of ultrasound data to capture more observations than the number of dimensions.
BUT… there is a simple solution to this problem: we can down-sample the images without losing the important information that we are interested in (this also has the added benefit of cutting down on processing time).
Let’s prepare a matrix that will be used to create a set of re-sampled images at a down-sampling factor of 37, which will give us slightly fewer dimensions than observations:
nframes <- dim(US.data$data2$images)[3]
imfact <- 37
tongue.dat <- matrix(nrow=nframes,
ncol=round(nrow(US.data$data2$images[,,1])/imfact) *
round(ncol(US.data$data2$images[,,1])/imfact))
ncol(tongue.dat)
## [1] 247
At a down-sampling factor of 37, we will have 247 dimensions for our 251 observations. That will allow us to create a well-fitted PCA model from the data. Now let’s downsize each of the images in the data set and save the resulting pixel sites in our new matrix:
for (frame in 1:nframes){
this.frame <- US.data$data2$images[,,frame]
s <- raster::raster(nrow=round(nrow(this.frame)/imfact),
ncol=round(ncol(this.frame)/imfact))
r <- raster::raster(this.frame)
raster::extent(r) <- raster::extent(c(-180, 180, -90, 90))
new.frame <- raster::resample(r, s)
tongue.dat[frame, ] <- raster::as.matrix(new.frame, nrow=1, byrow=T)
}
We can plot one of the down-sampled images (the last image of the set, in this case, since it was the last one to be run within the for loop) to ensure that the re-sampling actually makes sense. First, the original image:
image(this.frame, xaxt='n', yaxt='n', col=grayscale)
Then its down-sampled version:
image(raster::as.matrix(new.frame), xaxt='n', yaxt='n', col=grayscale)
It may not be pretty, but it will work for our exercise! When dealing with data for a real experiment, you will have many more than 251 frames in the ultrasound video, so your re-sampling factor will not need to be nearly this destructive. However, it will always be a good idea to down-sample the images in order to cut down on the processing time required to perform your analysis.
Now that the images have been down-sampled and structured appropriately, we can submit the entire matrix to a PCA model using the prcomp() function:
pca <- prcomp(tongue.dat)
If you take a look at a summary of the PCA model, you will notice there are 247 factors/PCs: exactly the same number of the original dimensions that were included in the model. This will always be the case with PCA: in order to explain every last bit of image variance (even noise in the image), you actually need 100% of the data. BUT… you don’t need 100% of the data to explain the majority of the image variance, which will be sure to capture the tongue movement (instead of, e.g., image noise). This is where PCA is most powerful: reducing dimensionality to a select number of features that explain the majority of variance in the input dimensions.
We can determine how many PCs are needed to explain the majority of the variance by referring to the proportion of variance explained by each PC. In the summary of the model, you will see two important values for each PC: “Proportion of Variance” and “Cumulative Proportion”. These refer to the percentage of variance explained independently by each PC, as well as the cumulative percentage of variance explained by the given PC plus all of the preceding PCs. For example, PC1 explains 62.4% of the total image variance on its own… while PC2 explains 12.4% of the total image variance on its own, but 74.8% of the image variance when combined with PC1, etc.
A (somewhat, arguably) common threshold for PC analysis is to retain as many PCs as it takes to explain at least 80% of the data. In our case, three PCs are needed to explain 83% of the data. Since the image variance will almost exclusively be explained by movement of the tongue within the image (the higher-order PCs will explain lesser variance like image noise), we can also interpret this as explaining 83% of all of the movements that the tongue makes for producing these 12 /hVd/ words. So we’ve effectively taken our 247 variables and distilled them down to just three variables that explain independent degrees of freedom in the data set. Pretty cool, huh?
Let’s run a bit of code to figure out this threshold automatically and prepare some variables for later use:
vars <- apply(pca$x, 2, var)
props <- vars / sum(vars)
var.exp <- cumsum(props)
to.keep <- var.exp[var.exp < 0.8]
to.keep <- length(to.keep) + 1
PCs <- rep(paste0('PC', seq(1, to.keep)))
Another way of interpreting this result is that there are three primary articulatory degrees of freedom involved in producing this wide range of vowel qualities in Australian English. But what exactly are these degrees of freedom? Can we interpret them in any phonetically meaningful way?
One way to interpret the results is to map the coefficients from the PCA model directly onto the location in the image frame from which the corresponding pixel sites originated. We can then view these coefficient maps to get a spatial interpretation of what each PC means:
coeff.maps <- c()
for (pc in PCs){
coeff.maps[[pc]] <- t(matrix(pca$rotation[,pc],
nrow = round(ncol(US.data$data2$images[,,1])/imfact),
byrow = TRUE))
}
image(coeff.maps$PC1, xaxt='n', yaxt='n', col=grayscale)
Here we can see that the there are positive coefficients at the back of the tongue (bright pixels) and negative coefficients at the front of the tongue (dark pixels). This tells us that PC1 probably has something to do with the tongue moving back and forward… tongue anteriority/posteriority?
Let’s do the same for PC2:
image(coeff.maps$PC2, xaxt='n', yaxt='n', col=grayscale)
Now there are negative coefficients at the top of the tongue and positive coefficients at the base of the tongue… something to do with… tongue height?
We can test to see if these interpretations make sense by plotting the PC1 and PC2 scores along with the /hVd/ word labels. Let’s do this for the vowel midpoint. The individual PC scores are stored in the separate columns of pca$x, and the word labels and time points associated with the vowel onset and offset are located in US.data$data2$segments:
midpoints <- rowMeans(US.data$data2$segments[,c('vstart','vend')])
midframes <- round(midpoints * US.data$SR$US)
PC1.mid <- pca$x[midframes, 1]
PC2.mid <- pca$x[midframes, 2]
plot(NA, xlim=range(PC1.mid), ylim=range(PC2.mid), xlab='PC1 score', ylab='PC2 score')
text(PC1.mid, PC2.mid, labels=US.data$data2$segments[,1])
Our interpretations of the PCA coefficient maps seem to be reasonable: PC1 is capturing distinctions of tongue anteriority/posteriority in the data, while PC2 is capturing distinctions of tongue height in the data.
Using these interpretations, we can now look at the dynamic articulation of these vowels by plotting the PC scores within the /hVd/ sequence:
word <- "who'd"
word.num<- which(US.data$data2$segments==word)
vstart <- US.data$data2$segments$vstart[word.num]
vend <- US.data$data2$segments$vend[word.num]
wstart <- US.data$data2$segments$wstart[word.num]
wend <- US.data$data2$segments$wend[word.num]
int <- round(seq(wstart*US.data$SR$US, wend*US.data$SR$US))
time <- seq(wstart, wend, length.out=length(int))
scores <- pca$x[int,]
PC <- 'PC1'
plot(time, scores[,c(PC)], type='l', xlab='Time (s)', ylab=paste(PC,'score'),
main=paste0('Australian English /hVd/ word: "',word,'"'))
abline(v=vstart, col='blue'); abline(v=vend, col='red')
With the vowel boundaries plotted in blue and red, and keeping in mind the interpretation of PC1 as tongue anteriority/posteriority (positive scores = fronted, negative scores = retracted), we can see that the vowel in the word “who’d” in Australian English seems to be diphthongal: it starts somewhat centralized then becomes increasingly more anterior throughout the vowel interval.
Now, the question to answer for the exercise: is this dynamic pattern due to an inherent property of the vowel, or is it due to anticipatory coarticulation with the coda /d/?
Choose other words in the data set to compare against “who’d” and make similar plots for these words, and make sure to look at both tongue anteriority/posteriority (PC1) and tongue height (PC2). Use these plots to defend your analysis.